knitr::opts_chunk$set(
  #collapse = TRUE,
  comment = "",
  fig.width = 4,
  fig.height = 4,
    message = FALSE,
    warning = FALSE,
    tidy.opts = list(
        keep.blank.line = TRUE,
        width.cutoff = 150
        ),
    options(width = 150),
    eval = TRUE
)

Installing the protsummarize and protlocassign packages

This package contains functions to identify outlier spectra and peptides and then to compute weighted averages of the spectral profiles to produce a single profile for each protein. Before we describe how to use these functions, we must cover some preliminary steps needed to use them.

Start by installing the devtools package from CRAN, by typing

install.packages("devtools")

Then install the protsummarize and (if you haven't already) the protlocassign package from the github repository by typing

devtools::install_github("mooredf22/protsummarize")
devtools::install_github("mooredf22/protlocassign")

This will make the programs and data sets available, as well as installing needed libraries

The package requires that you attach these libraries:

library(outliers)
library(snowfall)
library(lme4)

These libraries include the lme4 package, which implements a set of routines for working with nested data; specifically, in this case, with the nesting of spectra within peptides and peptides within proteins. The outliers package implements a score-based outlier detection routine, which is one option we provide. The snowfall package implements multicore processing.

In addition to the protsummarize package described in this vignette, we also require the companion package protlocassign since it contains the spectral level data.

library(protlocassign)
library(protsummarize)

Spectral-level data

The dataset spectraNSA_AT5tmtMS2 from Tannous, et al., is embedded in the protlocassign package. For simplicity, we first re-name it and then view it.

spectraNSA <- spectraNSA_AT5tmtMS2
str(spectraNSA, strict.width="cut", width=65)

Note that in this file, the first five columns contain information associated with the spectra assignment (protein identifier, peptide sequence, unique spectral identifier, location of peptide in protein, and post-translational/chemical modifications), while the next nine columns contain normalized specific amount data. The final two columns contain numbered lists of distinct proteins and peptides; these two columns are needed by the outlier detection functions described in the next section. The last section of this vignette explains how to read in data from an external text file and describes a function proteinDataPrep that aids in putting the data into the above-described format.

Removal of outliers

The first step is an outlier screen, which is performed at the spectrum level. For each spectrum, the normalized specific amount for each of the nine fractions is first log2 transformed via $y=log2(c+\epsilon)$ , where $\epsilon$ is an estimate of the background intensity in the abundance estimate $c$. We estimated $\epsilon$ as the median of the normalized specific amount of an isobaric labeling channel that lacks the bacterial spike in standard DrR57, which is known to be absent in mammalian cells, and was added to all other channels. For the TMT-MS2 experiment we found $\epsilon = 0.029885$. The package provides a choice of two different outlier rejection methods that can be invoked to process the data. The “boxplot” method rejects any value that exceeds a specified multiple of the interquartile range below the first or above the third quartile. This method was used in Tannous et al. (2020) using three times the interquartile range, which we invoke here by specifying outlierMeth="boxplot" and range=3. The “scores” method used in Jadot et al (2017) calculates normal scores (differences between each value and the mean of those values, divided by their standard deviation) and rejects values in the tails of a standard normal distribution for a specified percentile, such as 0.99. It may be invoked by specifying outlierMeth="scores" and proba=0.99. Note that the outlier rejection procedure is conducted on each channel, and the number of outliers for each spectrum are reported in the variable outlier.num.spectra, which can range from zero (no outliers) to the number of fractions (in this case 9). This process is repeated for every peptide from every protein. We reject any spectrum where outlier.num.spectra > 0.

We first process the data to identify outliers when grouping spectra to peptides using the function outlierFind and specifying that the outlier level is "peptide", meaning that profiles with outlier spectra within peptides (accepting all peptides unless they are composed entirely of spectra with outliers) will be flagged. The arguments numRefCols = 5 and numDataCols = 9 are used to identify position of the nine-element profile in the data frame. The argument randomError=T (which is the default) specifies that a small random number (uniform distribution ranging from zero to eps) will be added to profile elements that take the value zero. This addresses the problem that, if there are a large number of zero-valued elements, both the first and third quartile, and thus the interquartile range, will be zero. Without adding these random small amounts, too many "outliers" would be detected. The argument cpus=1 specifies that one CPU will be used for multiprocessing; we do this, and specify a random number seed, so that the results are exactly reproducible. Of course, a higher number may be specified if more CPU's are available, and if exact reproducibility is not required.

# We set the random number seed to ensure consistent output in the vignette; 
#  users should omit this or select their own seed
set.seed(17356)    
eps <- 0.029885209
flagSpectraBox <- outlierFind(protClass=spectraNSA, 
              outlierLevel="peptide", numRefCols=5, numDataCols=9, 
              outlierMeth="boxplot", range=3, eps=eps, 
              cpus=1, randomError=T)

Alternatively, we can specify using the "scores" method as follows. Note that the argument "range" is only used when outlierMeth="boxplot", and ignored otherwise. Similarly, the "proba" argument is only used when outlierMeth="scores"

set.seed(27681)  
flagSpectraScore <- outlierFind(protClass=spectraNSA, 
                outlierLevel="peptide", numRefCols=5, numDataCols=9, 
                outlierMeth="scores", proba=0.99,
                eps=eps, cpus=1, randomError=T)

We then examine the output files from the boxplot and scores methods:

str(flagSpectraBox, strict.width="cut", width=65)
str(flagSpectraScore, strict.width="cut", width=65)

We can determine the total number of spectra as well as the number of acceptable (e.g. non-outlier) and outlier spectra for both methods using the following statements:

length(flagSpectraBox$outlier.num.spectra)    # total number of spectra
sum(flagSpectraBox$outlier.num.spectra == 0)  # number of acceptable spectra
sum(flagSpectraBox$outlier.num.spectra != 0)  # number of outlier spectra

Similarly, we can determine the total number of spectra and the number of acceptable spectra for the scores method using the following statements:

length(flagSpectraScore$outlier.num.spectra)    # total number of spectra
sum(flagSpectraScore$outlier.num.spectra == 0)  # number of acceptable spectra
sum(flagSpectraScore$outlier.num.spectra != 0)  # number of outlier spectra

Note that in both examples shown above, the randomError argument was used (randomError=T). The reason for invoking this argument is illustrated in the two examples below where we set the argument to False or True when we use the boxplot method with a very large range (range=110) where we intuitively would expect that there would be very few outliers. However, if we use the randomError=F argument we find that there are 56 outliers:

flagSpectraBoxRange110ranErrF <- outlierFind(protClass=spectraNSA, 
              outlierLevel="peptide", numRefCols=5, numDataCols=9, 
              outlierMeth="boxplot", range=110, eps=eps, 
              cpus=4, randomError=F)
sum(flagSpectraBoxRange110ranErrF$outlier.num.spectra != 0)

We have identified 56 outliers because, as discussed above, there are some peptides with enough 0 values that the interquartile range is 0. We can fix this by specifying randomError=T, which reduces the number of outliers. Note in the example below there are six outliers. With repeated runs of outlierFind, unless the same random number seed is set and only one cpu is specified, the random error will change, which will result in slightly varying outcomes.

set.seed(652908)  # as before, the seed is an arbitrary number used by the random number generator
flagSpectraBoxRange110ranErrT <- outlierFind(protClass=spectraNSA, 
              outlierLevel="peptide", numRefCols=5, numDataCols=9, 
              outlierMeth="boxplot", range=110, eps=eps, cpus=1, randomError=T)
sum(flagSpectraBoxRange110ranErrT$outlier.num.spectra != 0)

The procedure described is useful in eliminating outlying spectra for estimates of peptide profiles. The next step is to create mean profiles for each peptide, and also to identify outlying peptide profiles.

Compute means for each protein-peptide combination

The function profileSummarize with option GroupBy="peptideId" computes the mean profile of each peptide, eliminating spectra that have been identified as outliers. The input data, flagSpectraBox contains six reference columns (see above), specified by numRefCols=6. A portion of these can be kept and carried over to the output file, but note that since the function here is producing a profile for each peptide from its relevant component spectra, some of the reference columns from the input data may not be appropriate. Here, the argument refColsKeep=c(1,2,4) causes the function to retain reference columns 1, 2, and 4, which correspond to "prot", "peptide", and "PeptidesStartPositionInProtein". If a field is kept which has multiple values across all spectra that contribute to a given peptide, only the single value from the first of these spectra is used. The argument refColsKeep is only used in the case where GroupBy is set to peptideId, and if not specified, the default is to keep only the first two columns. We use the argument outlierExclude="spectra" to indicate that, when computing mean profiles, we exclude spectra that are identified as outliers within their respective peptides. Note that in the output, Nspectra reports the number of spectra (in this case, all non-outlier spectra) used to calculate the mean peptide profile.

pepProfiles <- profileSummarize(protsCombineCnew=flagSpectraBox,
                numRefCols=6, numDataCols=9, refColsKeep=c(1,2,4),eps=eps,
                GroupBy="peptideId", outlierExclude="spectra")
str(pepProfiles, strict.width="cut", width=65)

Next, using this output, we use the function outlierFind to identify peptide outliers within proteins. We do this by specifying outlierLevel to be "protein" and do not need to specify that we use the default boxplot method with range=3 and the randomError=T. Note that in this procedure, all peptides in a given protein are weighted equally, regardless of how many non-outlier spectra are used to determine the average peptide profile.

set.seed(823537)
flagPepsBox <- outlierFind(protClass=pepProfiles,            
                    outlierLevel="protein", 
                    numRefCols=3, numDataCols=9, eps=eps, cpus=1)
str(flagPepsBox, strict.width="cut", width=65)

We determine the total number of peptides and the number of acceptable and outlier peptides using the following statements:

length(flagPepsBox$outlier.num.peptides)
sum(flagPepsBox$outlier.num.peptides == 0)
sum(flagPepsBox$outlier.num.peptides != 0)

We can now merge these peptide-level outliers into the full spectral level file. Here, we create a new dataframe, flagSpectraPeps, by adding outlier.num.peptides (column 4 from flagPepsBox) to flagSpectraBox, merging on pepId (column 17 from flagPepsBox).

data.frame(names(flagPepsBox))
flagSpectraPeps <- merge(x=flagSpectraBox, 
                        y=flagPepsBox[,c(4,17)], by="pepId")
str(flagSpectraPeps,  strict.width="cut", width=65)

We determine the number of spectra in the full data set and the number of acceptable spectra from acceptable peptides using the following statements:

length(flagSpectraPeps$outlier.num.peptides)
sum({flagSpectraPeps$outlier.num.spectra == 0} & 
      {flagSpectraPeps$outlier.num.peptides == 0})

We now calculate the mean profile for each protein, using a random effects linear model where spectra are nested within peptides. In the three examples listed below, we examine cases where we (1) reject the contribution of any outlying spectra and peptides, (2) reject the contribution of just outlier spectra within peptides, and (3) we don’t reject any peptides or spectra. Here, the outlierExclude argument overrides the outlier specifications in the flagSpectraPeps file. The resulting file contains the protein profile, indicating the number of peptides and spectra used to determine each protein profile using the rules above. The root portion of the name protNSA denotes that these values are normalized specific amounts (NSAs) for protein profiles. Note that the option cpus=4 is the default, but any number of cpu's can be specified.

protNSA_1 <- 
          profileSummarize(protsCombineCnew=flagSpectraPeps,
          numRefCols=7, numDataCols=9, eps=eps, GroupBy="protId",
          outlierExclude="spectraAndPeptide", cpus=4)
str(protNSA_1,  strict.width="cut", width=65)
protNSA_2 <- 
          profileSummarize(protsCombineCnew=flagSpectraPeps,
          numRefCols=7, numDataCols=9, eps=eps, GroupBy="protId",
          outlierExclude="spectra", cpus=4)
str(protNSA_2,  strict.width="cut", width=65)
protNSA_3 <- 
          profileSummarize(protsCombineCnew=flagSpectraPeps,
          numRefCols=7, numDataCols=9, eps=eps, GroupBy="protId",
          outlierExclude="none", cpus=4)
str(protNSA_3,  strict.width="cut", width=65)

When comparing the total numbers of peptides and spectra used to calculate protein profiles, we see that protNSA_3 > protNSA_2 > protNSA_1 for each:

sum((protNSA_3$Nspectra))
sum((protNSA_2$Nspectra))
sum((protNSA_1$Nspectra))

sum((protNSA_3$Npep))
sum((protNSA_2$Npep))
sum((protNSA_1$Npep))

Note that even though we only reject spectra in protNSA_2, there is still one rejected peptide because all spectra assigned to this peptide were outliers.

We can examine the number of proteins with missing profile values (caused by outlier rejection) in each of the three cases by choosing one, say, "M"; all other channels will also be missing.

sum(is.na(protNSA_1$M))
sum(is.na(protNSA_2$M))
sum(is.na(protNSA_3$M))

We see that if we reject all outlier spectra and peptides (protNSA_1) there is only one protein with missing profiles. If we only reject outlier spectra within peptides (protNSA_2), there are no proteins with missing profiles. Be aware, though, that the outlier rejection process includes a random component, so that another run could lead to proteins with missing profiles, and we in fact have seen this using different runs when not specifically setting the random number generator seed on the embedded data set.

Taking the first option (rejecting outlier spectra within peptides as well as outlier peptides) as the preferred method, we then create a protein profile dataframe in the same format as used in earlier vignettes by moving the protein names from a variable within the dataframe to row names.

protNSA_new <- protNSA_1 
#copy first field in dataframe containing protein names to rownnames field
rownames(protNSA_new) <- protNSA_1[,1]  
protNSA_new <- protNSA_new[,-1]  #delete first internal column containing protein names

The dataframe protNSA_new is essentially identical to the data set protNSA_AT5tmtMS2 included in the protlocassign package. (The argument countEQ=T causes the function to compute the mean error across all profiles, not just among those that differ. )

all.equal(protNSA_AT5tmtMS2, protNSA_new, countEQ=T, tolerance=0)

Creating a combined protein/peptide profile data frame

In this section we describe a function protPepProfile that creates a data frame with each protein profile followed by the peptide profiles associated with each protein and their outlier status. The function requires two arguments. The first is a data frame of peptide profiles specified by flagPeps = flagPepsBox. The second is a data frame of protein profiles specified by protProfileData = protNSA_1; this data frame must have prot in the first column and the profiles in the following numDataCols columns. These two data frames must be in a consistent form and use the same type of profile data (e.g., NSA), as created earlier in this vignette. We first re-examine the list of column names of flagPepsBox to determine that there are four reference columns preceding the nine columns of profile data:

data.frame(names(flagPepsBox))
protPepProfileNSA <- protPepProfile(flagPeps=flagPepsBox,
                                 numRefCols=4, numDataCols=9, 
                                 protProfileData=protNSA_1)
str(protPepProfileNSA)

Finally, for later CPA analysis, we need the dataframe to contain a unique rowname for each profile, which we obtain from the peptide column. Here, for peptides this is a concatenation of the protein name and peptide sequence while for proteins this is the protein name alone.

rownames(protPepProfileNSA) <- protPepProfileNSA$peptide

We examine the first few row names here:

head(rownames(protPepProfileNSA))

(Note that this data frame is already embedded in the protsummarize package with the name protPepNSA_AT5tmtMS2.)

all.equal(protPepProfileNSA, protPepNSA_AT5tmtMS2, countEQ=T, tolerance=0)

We may use it to calculate CPA estimates with the fitCPA function. These calculations, as well as transformations, are discussed in Vignette 7.

Reading in spectral-level data from an external file

The external data file must have the following structure. For each spectrum, the first and second columns must be a protein and peptide identifier. The next columns can contain ancillary information (e.g., a unique spectral identifier, the position of a peptide within a protein, etc) followed by NSA profiles. For example, suppose your data is in a comma-delimited file “myProteinData.csv” in the subdirectory “C:\temp\myProteinInput”. You may read it into R as follows; the str command provides a look at the structure of the data that you have read in.

mySpectraData <- read.csv("C:\\temp\\mySpectraInput\\mySpectraData.txt")
str(mySpectraData, strict.width="cut", width=65)

As an example, here are commands to read in the Tannous AT5 TMT-MS2 data from an external file. This file is distributed with the protsummarize package as a comma-delimited file. Access the location of the data and view that location as follows:

AT5tmtMS2fileName <- system.file("extdata", "AT5tmtMS2_NSA.csv", package="protsummarize")

Now that we have the full file name in the variable AT5tmtMS2fileName, we can read it in and examine it as follows:

spectraNSA_AT5tmtMS2_temp <- read.csv(AT5tmtMS2fileName)
str(spectraNSA_AT5tmtMS2_temp)

If the protein names (which are in column 1) are prepended by a single quote (as we do to prevent Excel from interpreting some protein names as dates), it may be removed by using the sub function to substitute the initial quote (denoted by "^'") with an empty character ("") as shown here:

spectraNSA_AT5tmtMS2_temp[,1] <- sub("^'", "", spectraNSA_AT5tmtMS2_temp[,1])

The function proteinDataPrep is available to aid in formatting a file in the required form. This takes input data, sorts it by the first and second columns (which are renamed prot and peptide), appends all remaining columns (retaining their names), and generates a unique numerical ID for prot and peptide. We specify numRefCols=5 because that is the number of columns immediately preceding the profile data, and numDataCols=9 because that is the number of fractions in each profile.

spectraNSA_new <- proteinDataPrep(spectraNSA_AT5tmtMS2_temp, numRefCols=5,
         numDataCols=9)
str(spectraNSA_new, strict.width="cut", width=65)

This data set is identical to the spectral-level data set spectraNSA_AT5tmtMS2 included with the protsummarize package:

all.equal(spectraNSA_new, spectraNSA_AT5tmtMS2)

Note that in some cases, one may want to change the peptide identifier to include additional information other than the actual sequence such as presence of post-translational modifications. In this case, one would create a new variable PepMod prior to running proteinDataPrep, and then insert that new column after the first column, which contains the protein name.

References

Jadot M, Boonen M, Thirion J, Wang N, Xing J, Zhao C, Tannous A, Qian M, Zheng H, Everett JK, Moore DF, Sleat DE, Lobel P (2016) Accounting for protein subcellular localization: a compartmental map of the rat liver proteome. Molecular and Cellular Proteomics 16, 194-212. doi:10.1074/mcp.M116.064527 PMCID: PMC5294208

Tannous A, Boonen M, Zheng H, Zhao C, Germain C, Moore D, Sleat D, Jadot M, Lobel P. (2020) Comparative Analysis of Quantitative Mass Spectrometric Methods for Subcellular Proteomics. Journal of Proteome Research. Journal of Proteome Research 19, 1718-1730. doi: 10.1021/acs.jproteome.9b00862



mooredf22/protsummarize2 documentation built on May 16, 2021, 10:12 p.m.